360 $^\ circ $视频显着性检测是360 $^\ circ $视频理解的具有挑战性的基准之一,因为不可忽略的失真和不连续性发生在任何格式的360 $^\ circ $视频中,并捕​​获 - 并捕获 - 在全向球体中,值得的观点本质上是模棱两可的。我们提出了一个名为Panoramic Vision Transformer(摊铺机)的新框架。我们使用具有可变形卷积的Vision Transformer设计编码器,这不仅使我们不仅可以将正常视频介绍的模型插入我们的体系结构中,而无需其他模块或填充,而且只能执行一次几何近似,这与以前的基于CNN的深入基于CNN的方法不同。多亏了其功能强大的编码器,摊铺机可以通过本地补丁功能之间的三个简单相对关系来学习显着性,在没有监督或辅助信息(例如类激活)的情况下,通过大幅度的大幅度优于Wild360基准的最先进模型。我们通过VQA-ODV中的全向视频质量评估任务来证明我们的显着性预测模型的实用性,在这里,我们始终在没有任何形式的监督(包括头部运动)的情况下提高性能。
translated by 谷歌翻译
大多数神经文本到语音(TTS)模型需要<语音,转录器>来自所需扬声器的成对数据,以获得高质量的语音合成,这限制了大量未经过滤的训练数据的使用。在这项工作中,我们呈现导向TTS,这是一种高质量的TTS模型,用于从未筛选的语音数据生成语音。引导TTS将无条件扩散概率模型与单独培训的音素分类器组合以进行文本到语音。通过对语音的无条件分配建模,我们的模型可以利用未经筛选的培训数据。对于文本到语音合成,我们通过音素分类指导无条件DDPM的生成过程,以产生来自给定转录物的条件分布的MEL-谱图。我们表明,导向TTS与现有的方法实现了可比性的性能,而没有LJSpeech的任何成绩单。我们的结果进一步表明,在MultiSpeaker大规模数据上培训的单个扬声器相关的音素分类器可以指导针对各种扬声器执行TTS的无条件DDPM。
translated by 谷歌翻译
卷积是现代神经网络最重要的特征变革,导致深度学习的进步。最近的变压器网络的出现,取代具有自我关注块的卷积层,揭示了静止卷积粒的限制,并将门打开到动态特征变换的时代。然而,现有的动态变换包括自我关注,全部限制了视频理解,其中空间和时间的对应关系,即运动信息,对于有效表示至关重要。在这项工作中,我们引入了一个关系功能转换,称为关系自我关注(RSA),通过动态生成关系内核和聚合关系上下文来利用视频中丰富的时空关系结构。我们的实验和消融研究表明,RSA网络基本上表现出卷积和自我关注的同行,在标准的运动中心基准上实现了用于视频动作识别的标准主导的基准,例如用于V1&V2,潜水48和Filegym。
translated by 谷歌翻译
最近的研究确定,大规模神经语言模型的学识渊博的令牌嵌入被退化为各向异性,形状狭窄。这种现象称为表示变性问题,促进了对模型性能产生负面影响的令牌嵌入之间的总体相似性的增加。尽管基于对问题触发的现象的观察,解决了变性问题的现有方法改善了文本生成的性能,但仍未探索变性问题背后的令牌嵌入的训练动力学。在这项研究中,我们分析了关注稀有令牌嵌入的令牌嵌入的训练动力学。我们证明,稀有令牌嵌入的梯度的特定部分是训练阶段中所有令牌变性问题的关键原因。基于分析,我们提出了一种称为自适应梯度门控(AGG)的新方法。 AGG通过对稀有令牌嵌入的梯度的特定部分进行门控来解决变性问题。语言建模,单词相似性和机器翻译任务的实验结果定量,定性地验证了AGG的有效性。
translated by 谷歌翻译
具有集群潜在空间的生成对抗网络(GANS)可以以完全无监督的方式执行条件生成。在现实世界中,未标记数据的突出属性可能是不平衡的。但是,现有的大多数无监督的条件GAN不能正确地将这些数据的群集属于它们的潜在空间,因为它们假设属性的均匀分布。为了解决这个问题,我们理论上派生的斯坦潜在优化,提供了在连续潜在空间中之前的高斯混合物的潜在分布参数的重新传播参数的梯度估计。在结构上,我们引入了编码器网络和新颖的无监督条件对比丢失,以确保从单个混合组件生成的数据表示单个属性。我们确认,即使在没有属性信息的情况下。此外,我们证明可以使用少量探测数据来操纵所学习的属性。
translated by 谷歌翻译
时空卷积通常无法学习视频中的运动动态,因此在野外的视频理解需要有效的运动表示。在本文中,我们提出了一种基于时空自相似性(STS)的丰富和强大的运动表示。给定一系列帧,STS表示每个局部区域作为空间和时间的邻居的相似度。通过将外观特征转换为关系值,它使学习者能够更好地识别空间和时间的结构模式。我们利用了整个STS,让我们的模型学会从中提取有效的运动表示。建议的神经块被称为自拍,可以轻松插入神经架构中,并在没有额外监督的情况下训练结束。在空间和时间内具有足够的邻域,它有效地捕获视频中的长期交互和快速运动,导致强大的动作识别。我们的实验分析证明了其对运动建模方法的优越性以及与直接卷积的时空特征的互补性。在标准动作识别基准测试中,某事-V1&V2,潜水-48和FineGym,该方法实现了最先进的结果。
translated by 谷歌翻译
Vision Transformers (ViTs) have become a dominant paradigm for visual representation learning with self-attention operators. Although these operators provide flexibility to the model with their adjustable attention kernels, they suffer from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy of the ViT layers, and (2) the complexity in computation and memory is quadratic in the sequence length. In this paper, we propose a novel attention operator, called lightweight structure-aware attention (LiSA), which has a better representation power with log-linear complexity. Our operator learns structural patterns by using a set of relative position embeddings (RPEs). To achieve log-linear complexity, the RPEs are approximated with fast Fourier transforms. Our experiments and ablation studies demonstrate that ViTs based on the proposed operator outperform self-attention and other existing operators, achieving state-of-the-art results on ImageNet, and competitive results on other visual understanding benchmarks such as COCO and Something-Something-V2. The source code of our approach will be released online.
translated by 谷歌翻译
The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently, Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called Class-Continuous Conditional Generative NeRF ($\text{C}^{3}$G-NeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed $\text{C}^{3}$G-NeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, $\text{C}^{3}$G-NeRF exhibits a Fr\'echet Inception Distance (FID) of 7.64 in 3D-aware face image synthesis with a $\text{128}^{2}$ resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with $\text{C}^{3}$G-NeRF.
translated by 谷歌翻译
In both terrestrial and marine ecology, physical tagging is a frequently used method to study population dynamics and behavior. However, such tagging techniques are increasingly being replaced by individual re-identification using image analysis. This paper introduces a contrastive learning-based model for identifying individuals. The model uses the first parts of the Inception v3 network, supported by a projection head, and we use contrastive learning to find similar or dissimilar image pairs from a collection of uniform photographs. We apply this technique for corkwing wrasse, Symphodus melops, an ecologically and commercially important fish species. Photos are taken during repeated catches of the same individuals from a wild population, where the intervals between individual sightings might range from a few days to several years. Our model achieves a one-shot accuracy of 0.35, a 5-shot accuracy of 0.56, and a 100-shot accuracy of 0.88, on our dataset.
translated by 谷歌翻译
Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning (RL), but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality and outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem.
translated by 谷歌翻译